1 Introduction to Dataset, Load and Clean Data and Load Libraries

1.1 Introduction to Dataset

Prosper is a website that connects lenders with borrowers. It takes data from the lenders like credit score, employment history, and the reason for getting the loan and provides it to lenders. Lenders can choose who they wish to loan money to. Prosper provides a dataset of over 113,000 loans that were made with over 80 data points on each loan. This investigation sorts through the dataset to find interesting trends and create (hopefully) insightful graphics depicting these trends.

1.2 Load Libraries and Data

library(ggplot2)
library(GGally)
library(RColorBrewer)
library(corrplot)
setwd("/home/nathan/Desktop/Nanodegree/P4 - Loan Data Visualization with R")
loan = read.csv('prosperLoanData.csv')

1.3 Transform Data

Listing data was provided as a numerical value with a separate legend. This section transforms it to a factor variable and adds back in the descriptions for each number. Also in this section the Occupation variable is reordered so that the boxplot later on is easier to read.

Current Delinquencies is cut into 4 buckets to make a variable that can be more easily used to show color on a plot. The table function is used before cut to determine how to equally size the buckets.

a <-loan$ListingCategory..numeric.
loan$ListingCategory[a == 0] <- "0 - Not Available"
loan$ListingCategory[a == 1] <- "1 - Debt Consolidation"
loan$ListingCategory[a == 2] <- "2 - Home Improvement"
loan$ListingCategory[a == 3] <- "3 - Business" 
loan$ListingCategory[a == 4] <- "4 - Personal Loan"  
loan$ListingCategory[a == 5] <- "5 - Student Use" 
loan$ListingCategory[a == 6] <- "6 - Auto"  
loan$ListingCategory[a == 7] <- "7- Other" 
loan$ListingCategory[a == 8] <- "8 - Baby&Adoption" 
loan$ListingCategory[a == 9] <- "9 - Boat" 
loan$ListingCategory[a == 10] <- "10 - Cosmetic Procedure" 
loan$ListingCategory[a == 11] <- "11 - Engagement Ring" 
loan$ListingCategory[a == 12] <- "12 - Green Loans"
loan$ListingCategory[a == 13] <- "13 - Household Expenses"
loan$ListingCategory[a == 14] <- "14 - Large Purchases"
loan$ListingCategory[a == 15] <- "15 - Medical/Dental"
loan$ListingCategory[a == 16] <- "16 - Motorcycle"
loan$ListingCategory[a == 17] <- "17 - RV"
loan$ListingCategory[a == 18] <- "18 - Taxes"
loan$ListingCategory[a == 19] <- "19 - Vacation"
loan$ListingCategory[a == 20] <- "20 - Wedding Loans"

loan$ListingCategory <- as.factor(loan$ListingCategory)

loan$ListingCategory <- with(loan, 
                             reorder(ListingCategory, BorrowerRate, median))

loan$Occupation <- with(loan, reorder(Occupation, BorrowerRate, median))

loan$CurDelBucket <- cut(loan$CurrentDelinquencies, c(-Inf,0,1,5,+Inf))

2 Univariate Exploration

After the libraries and data are loaded, the next step is to take a look at the list of variables in the data and plot single variables (univariate plots) on a histogram or box plot to get an idea of the range of different variables. This will give an idea of what could be explored later.

summary(loan$CreditScoreRangeLower)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     0.0   660.0   680.0   685.6   720.0   880.0     591

Most credit scores for people getting loans on Prosper are between 600 and 800 with the mean around 685. It would be interesting to compare this trend to the credit scores of the general population.

summary(loan$BorrowerRate)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.1340  0.1840  0.1928  0.2500  0.4975

The rate that most borrowers are charged looks like a normal distribution with mean around 19%. There are some outliers on the higher end of the spectrum that may need to be dealt with. This variable will be a key part of the bivariate and multivariate analysis because for a potential borrower, it is useful to know the factors that affect the interest rate that they will ultimately be paying.

summary(loan$EstimatedLoss)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.005   0.042   0.072   0.080   0.112   0.366   29084

The mean estimated loss is around 8% and the third quartile is 11.2%. This indicates that most loans lose less than 15%.

summary(loan$EstimatedReturn)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  -0.183   0.074   0.092   0.096   0.117   0.284   29084

Nearly all loans have a positive Estimated Return. This is to be expected because a lender would likely not make a load if the estimated return is negative.

summary(loan$OpenRevolvingAccounts)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    4.00    6.00    6.97    9.00   51.00

This trend shows a positive skew. However the tail of the distribution is not long enough to use a log scale on the x axis to attempt to make the distribution more normal. The plot below changes the x axis to a logaritmic scale and you can see that the numbers are too clustered around 1 for it to transform the graph properly into a normal distribution.

summary(loan$OpenRevolvingMonthlyPayment)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0   114.0   271.0   398.3   525.0 14980.0

The trend for open revolving monthly payment is similar the the revolving accounts trend where it is positively skewed and has a long tail. The values in this chart are sufficiently large that it responds better to a log 10 transformation of the x axis. The resulting distribution is below and is pretty close to a normal distribution.

summary(loan$AmountDelinquent)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
##      0.0      0.0      0.0    984.5      0.0 463900.0     7622

Amount delinquent is also a very clear long tail trend. Below is the same plot with the x axis transformed to log10 scale. To create these plots, ggplot was used instead of the prebuilt function because the 0 values for AmountDelinquent had to be excluded from the data set.

In this case transforming the x axis to a log scale gives what appears to be a normal distribution. This means tools like z-scores to predict delinquent amount where a certain percentage of the sample would be below or above that value will work.

summary(loan$BankcardUtilization)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.310   0.600   0.561   0.840   5.950    7604

This trend has a fairly linear distribution and does not appear that it would respond well to a trasformation or the assumptions of a normal distribution.

summary(loan$DebtToIncomeRatio)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.140   0.220   0.276   0.320  10.010    8554

The debt to income ratio trend shows positive skew and this is confirmed by the fact that the mean (~28%) is much higher than the median (22%).

3 Bivariate Exploration

In this section the relationship between several pairs of variables will be plotted. Again this is to get an idea of what kinds of relationships exist in the data to see what trends are worth exploring further. The first plots will be a correlation matrices that plot several variables against each other in a matrix to see if there are any interesting relationships. The correlation matrices are targetted at the Borrower Rate and Credit Score variables. In the first 2 matrices, the borrower rate plotted against 9 other variables to see which relationships might be worth investigating further. The second 2 matrices plot credit score against several variables for the same purpose.

3.1 Borrower Rate Correlation Matrices

As shown in the correlation matrix, there is a very high correlation between Borrower Rate and Estimated Loss. This is expected because Borrowers that are seen as riskier will be given higher interest rates. The expected loss will also be higher for borrowers that are seen as risky. This is the expected relationship that economic theory predicts - higher risk (estimated loss) demands a higher return (interest rate).

Interestingly, Employment Duration (EmplDur), Revolving Monthly Payment (RevMnthPmt), Amount Delinquent (AmntDelinq) and Debt to Income Ratio (DtoIRatio) have no significant correlation with Borrower Rate. I would have predicted that Debt to Income Ratio would have some correlation with borrower rate as would seem to be a clear indicator of risk.

Borrower Rate has a moderate negative correlation with Credit Score (CScoreLow) (negative correlation). Credit Score is expected to be correlated because the credit score metric attempts to quantify how likely a borrower is to pay back a loan.

There are 2 variables appear to be correlated with Borrower Rate. 1. Bank Card Utilization has a weak positive correlation with Borrower Rate. This seems to make sense in that borrowers who have maxed out their current credit lines would seem to be a higher risk. However, if a borrower is only applying for their second or third credit line, they would seem to be less risky than someone applying for their seventh or eighth line of credit as they likely have less credit. This multivariate relationship will be studied more in section 1.4 Multivariate Exploration. 2. Available Bank Credit has a somewhat stronger negative correlation with Borrower Rate. This is interesting in that higher available credit correlates with a lower Borrower Rate and lower credit utilization correlates with a lower Borrower Rate. A deeper exploration of the multivariate relationship between amount of credit, credit utilization and borrower rate will follow in section 1.4.

3.2 Credit Score Correlation Matrices

Most variables in this plot show very weak or no correlation with credit score. These variables do not show much promise for continued exploration.

A number of the variables in this plot show weak to moderate correlations with credit score. The list below shows the variable name and the magnitude and direction of correlation with credit score. - Current Delinquencies - moderate, negative - Delinquencies Last 7 Years - moderate, negative - Bankcard Utilization - moderate, negative - Available Bankcard Credit - moderate, positive - Stated Monthly Income - very weak, positive

Some of these variables will be explored more deeply in section 1.4. This is a good list to start building a model for predicting credit scores.

3.3 Credit Score vs. Borrower Rate

cor.test(loan$CreditScoreRangeLower, loan$BorrowerRate)
## 
##  Pearson's product-moment correlation
## 
## data:  loan$CreditScoreRangeLower and loan$BorrowerRate
## t = -175.17, df = 113340, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.4661358 -0.4569730
## sample estimates:
##        cor 
## -0.4615667

The plot seems to show a negative correlation as expected by the -0.46 correlation coefficient, but it is not that easy to see with the scatter plot. Adding the mean, median, 10% and 90% quantile summary lines would give a clearer indication of what is happening with these 2 variables.

The summary lines make the trend much clearer. The interest rate that the borrower pays (borrower rate) decreases as the credit score increases, however there is still a lot of variation in the interest rate for a given credit score. Credit score is clearly one component that determines the borrower rate but it does not tell the whole story. Other variables that correlate with borrower rate will be worth investigating, especially for cases where the credit score is below 600.

This plot also clearly shows that there are fewer loans established when the borrower has a credit score lower than 600 or higher than 800. This could be for a number of reasons but 3 reasons that could be worth investigating would be: - borrowers with lower credit scores are more likely to be denied loans - people with high credit scores are less likely to borrow money - the majority of people have credit scores between 600 and 800 and this is why there are more loans established

Unfortunately more data would be required to investigate this further. The distribution of credit scores from a random sample of the general population and information on applicants who were denied a loan would be required to investigate further.

3.4 Bank Card Utilization vs Borrower Rate

cor.test(loan$BankcardUtilization, loan$BorrowerRate)
## 
##  Pearson's product-moment correlation
## 
## data:  loan$BankcardUtilization and loan$BorrowerRate
## t = 86.168, df = 106330, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2498551 0.2610917
## sample estimates:
##      cor 
## 0.255482

There is a weak positive correlation (~0.25) between Bankcard Utilization and Borrower Rate. The plot seems to a positive correlation as expected but it is far too difficult to see. Again adding summary lines will make the trends clearer.

The summary lines show a much clearer positive correlation. They also show why the correlation is weak even though the positive trend is continuous. The variation is quite high as show by the 10% and 90% quantile lines. Even though the trend is upwards, the high degree of variation seems to cause the correlation to be low.

3.5 Available Bankcard Credit vs Borrower Rate

cor.test(loan$AvailableBankcardCredit, loan$BorrowerRate)
## 
##  Pearson's product-moment correlation
## 
## data:  loan$AvailableBankcardCredit and loan$BorrowerRate
## t = -119.44, df = 106390, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.3491486 -0.3385518
## sample estimates:
##        cor 
## -0.3438611

The negative correlation between Available Bankcard Credit and Borrower Rate is clearly shown by the summary line. In this case it appears to be an exponentially decreasing trend. At low values of Available Bankcard Credit, the Borrower Rate falls quickly. Between $25,000-50,000 of credit, the trend seems to level off and become far more volatile. This would seem to indicate that having at least $25,000 to $50,000 of Available Credit would be helpful in getting a lower Borrower Rate but beyond that, other factors take over.

3.6 Home Ownership vs. Borrower Rate

Home ownership appears to impact interest rates. Those who own their home have lower median and first and third quartile Borrower Rates. There are also no high end outliers for homeowners and no rates above ~36% while non-homeowners show several extreme rates between 35-50%.

3.7 Employment Status vs. Borrower Rate

As is expected, the box plot shows that those who are unemployed have the highest Borrower Rates. Self employed people have higher rates than those employed full or part time. Those who are retired seem to have slightly higer rates than full or part time employed individuals, but lower that those who are self employed. Adding the scatter plot with alpha, color and jitter shows how many data points there are in each employment category. There are far fewer loans given to unemployed, retired or part time individuals. Most loans are made to those who indicate that they are employed, employed full-time or are self employed.

3.8 Reason for Loan vs. Borrower Rate

There seems to be a faily large difference in the median interest rate depending on what the reason for the loan is. Personal loans, boat purchases, baby & adoption and debt consolidation appear to have the lowest borrower rates. Household expenses and cosmetic procedures have the highest rates. The overlay of the scatterplot also shows which reasons are more common for loans. Debt consolidation, business and home improvement are the most common reasons listed. Boats, RVs, Green Loans and cosmetic procedures are the least frequently listed reasons. This variable would be useful to overlay as a third factor in the multivariate section.

3.9 Occupation vs. Borrower Rate

The chart shows a fairly clear trend that professional, high education jobs (judge, engineer, pharmacist, professor) tend to have lower borrower rates while lower skill jobs with lower educational requirements (teachers aide, bus driver, clerical, student, flight attendant) tend to have higher borrower rates.

4 Multivariate Exploration

In this section, more that 2 variables will be investigated to see how they interact. The existing correlations that have been uncovered will be expanded on to try to create plots that provide more insight.

4.1 Credit Score, Bankcard Utilization and Borrower Rate

The plot seems to show that those with lower Bankcard Utilization have higher credit scores and pay lower Borrower Rates. Those with high Bankcard Utilization tend to be on the left hand side of the plot with the lower credit scores and the higher borrower rates.

4.2 Credit Score, Home Ownership and Borrower Rate

The plot shows that those who own their homes appear to have a higher credit score and appear to have lower borrower rates, especially as credit score increases.

4.3 Credit Score, Employment Status and Borrower Rate

This plot seems to show that for low credit scores, the borrowers employment status does not have a clear effect on the borrower rate. However, it appears that once the credit score reaches around 650, the borrower rate for full-time employed borrowers tends to decrease while the same is not true for those who are not employed. The next plot zooms in on the credit score 650-800 section of the plot.

4.4 Available Bankcard Credit, Bankcard Utilization and Borrower Rate

The trendlines show that the optimal combination to get a low interest rate is to have a significant amount of available credit (more than $50,000) and to have a credit utilization of 25% or less.

4.5 Available Bankcard Credit, Current Delinquencies and Credit Score

This plot took several iterations. The first plot shows that the mean trendline requires some smoothing so Available Bank Credit is rounded to the nearest 100. Few borrowers have more than $10,000 credit so the plot is cut off at $10,000. The final plot switches to version of Current Delinquencies that has been cut into 4 buckets. In the second visual, the dark blue dots dominate the plot and this is because a vast majority of borrowers have 0, 1 or 2 delinquencies (see table(CurrentDelinquencies) in section 1). Cutting the Current Delinquencies variable into 4 buckets makes the plot clearer. Further analysis in section 5.

4.6 Credit Score, Listing Category (Reason for Loan) and Borrower Rate

Listing categories “Not Available”, Debt Consolidation, Business and Home Improvement have a high number of loans and median Borrower Rates on the lower side of half. Auto, “Other” and Hosehold Expenses have the highest number of loans with Borrower Rates on the higher side of half.

5 Final Plots and Summary

5.1 Plot One

Summaries and Correlations

summary(loan$ListingCategory)
##       4 - Personal Loan                9 - Boat       0 - Not Available 
##                    2395                      85                   16965 
##       8 - Baby&Adoption  1 - Debt Consolidation    14 - Large Purchases 
##                     199                   58308                     876 
##            3 - Business         5 - Student Use    2 - Home Improvement 
##                    7189                     756                    7433 
##         16 - Motorcycle        12 - Green Loans    11 - Engagement Ring 
##                     304                      59                     217 
##                6 - Auto              18 - Taxes           19 - Vacation 
##                    2572                     885                     768 
##      20 - Wedding Loans     15 - Medical/Dental                 17 - RV 
##                     771                    1522                      52 
##                7- Other 13 - Household Expenses 10 - Cosmetic Procedure 
##                   10494                    1996                      91
summary(loan$BorrowerRate)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.1340  0.1840  0.1928  0.2500  0.4975
summary(loan$CreditScoreRangeLower)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     0.0   660.0   680.0   685.6   720.0   880.0     591
cor.test(loan$BorrowerRate, loan$CreditScoreRangeLower)
## 
##  Pearson's product-moment correlation
## 
## data:  loan$BorrowerRate and loan$CreditScoreRangeLower
## t = -175.17, df = 113340, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.4661358 -0.4569730
## sample estimates:
##        cor 
## -0.4615667

Description One

This plot reveals some very interesting insights. The “Not Available” listing category shows a very high number of loans and a high number of loans given to borrowers with low credit. The blue coloring corresponds to a low interest rate and the plot clearly shows the most blue coloring on the “Not Available” box plot. This is a strange finding as not giving a reason for a loan could be seen as higher risk, yet low credit scores are getting more reasonable rates when compared with other categories. Having more detailed knowledge about the industry and why this trend may exist would shed some light. Discussing the trend with someone in the industry would be helpful in this scenario. Removing the “Not Available” loans from the data set would likely yeild higher correlations between other variables as that listing category likely skews the dataset for those with low credit scores.

5.2 Plot Two

Summaries and Correlations

summary(loan$BankcardUtilization)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.310   0.600   0.561   0.840   5.950    7604
cor.test(loan$CreditScoreRangeLower, loan$BankcardUtilization)
## 
##  Pearson's product-moment correlation
## 
## data:  loan$CreditScoreRangeLower and loan$BankcardUtilization
## t = -144.58, df = 106330, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.4103489 -0.4003027
## sample estimates:
##       cor 
## -0.405338
cor.test(loan$BorrowerRate, loan$BankcardUtilization)
## 
##  Pearson's product-moment correlation
## 
## data:  loan$BorrowerRate and loan$BankcardUtilization
## t = 86.168, df = 106330, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2498551 0.2610917
## sample estimates:
##      cor 
## 0.255482

Description Two

The plot shows that those with lower Bankcard Utilization have higher credit scores and pay lower Borrower Rates. The blue is clearly concentrated in the bottom lower right corner of the plot. Those with high Bankcard Utilization tend to be on the left hand side of the plot with the lower credit scores and the higher borrower rates. On the far left side of the plot it shows that there are a few borrowers with very low Bankcard Utilization, low credit and higher borrower rates. This may show borrowers with very little credit or no credit and it is showing their utilization as 0.

5.3 Plot Three

Summaries and Correlations

summary(loan$CurrentDelinquencies)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.0000  0.0000  0.0000  0.5921  0.0000 83.0000     697
cor.test(loan$CurrentDelinquencies, loan$CreditScoreRangeLower)
## 
##  Pearson's product-moment correlation
## 
## data:  loan$CurrentDelinquencies and loan$CreditScoreRangeLower
## t = -133.37, df = 113240, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.3734729 -0.3634055
## sample estimates:
##      cor 
## -0.36845
cor.test(loan$CurrentDelinquencies, loan$AvailableBankcardCredit)
## 
##  Pearson's product-moment correlation
## 
## data:  loan$CurrentDelinquencies and loan$AvailableBankcardCredit
## t = -30.276, df = 106370, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.09838769 -0.08647140
## sample estimates:
##         cor 
## -0.09243285

Description Three

The mean trendline shows that credit score increases as available credit increases. Layering on current delinquencies shows that those with more than 1 delinquency tend to have very poor credit and tend not to have much credit (blue and purple dots are clustered in the lower left corner between 500-600 credit score and less that $2500 available credit.) This makes sense as individuals with poor credit are not likely to have new loans approved, and those with current delinquencies will have poor credit and will also be rejected for new loans.


6 Reflection

The loan dataset has many variables and many of these variables are cross correlated. This makes it very hard to determine cause and effect. Throughout the investigation I tried to answer two questions: - What factors affect interest rate? - What factors affect credit score?

The assumption that I had was that a higher credit score will yield a lower interest rate. This correlation was true however credit score was only a one factor that predicted the interest rate. Several other factors played into the interest rate.

The issue with multivariate plots is that all three variables tend to be correlated which means that you may be showing redundent information. For example, an individual with high available credit is more likely to have a good credit score and also more likely to have a lower interest rate. If I were to repeat the analysis or add to it later, I would look for variables that appear to have no correlation with credit score but do seem to correlate with interest rate. This may yield a multivariate plot that has more explanatory power or a clearer trend because both the x axis variable and the color variable are adding different information to the plot (when they are correlated with each other they are really just displaying somewhat redundant information).